LipNet: End-to-End Sentence-level Lipreading
نویسندگان
چکیده
Lipreading is the task of decoding text from the movement of a speaker’s mouth. Traditional approaches separated the problem into two stages: designing or learning visual features, and prediction. More recent deep lipreading approaches are end-to-end trainable (Wand et al., 2016; Chung & Zisserman, 2016a). However, existing work on models trained end-to-end perform only word classification, rather than sentence-level sequence prediction. Studies have shown that human lipreading performance increases for longer words (Easton & Basala, 1982), indicating the importance of features capturing temporal context in an ambiguous communication channel. Motivated by this observation, we present LipNet, a model that maps a variable-length sequence of video frames to text, making use of spatiotemporal convolutions, a recurrent network, and the connectionist temporal classification loss, trained entirely end-to-end. To the best of our knowledge, LipNet is the first end-to-end sentence-level lipreading model that simultaneously learns spatiotemporal visual features and a sequence model. On the GRID corpus, LipNet achieves 95.2% accuracy in sentence-level, overlapped speaker split task, outperforming experienced human lipreaders and the previous 86.4% word-level state-of-the-art accuracy (Gergen et al., 2016).
منابع مشابه
LipNet: Sentence-level Lipreading
Lipreading is the task of decoding text from the movement of a speaker’s mouth. Traditional approaches separated the problem into two stages: designing or learning visual features, and prediction. More recent deep lipreading approaches are end-to-end trainable (Wand et al., 2016; Chung & Zisserman, 2016a). All existing works, however, perform only word classification, not sentence-level sequenc...
متن کاملUsing Surface-Learning to improve Speech Recognition with Lipreading
We explore multimodal recognition by combining visual lipreading with acoustic speech recognition. We show that combining the visual and acoustic clues of speech improves the recog nition performance significantly especially in noisy environment. We achieve this with a hybrid speech recognition architecture, consisting of a new visual learning and tracking mechanism, a channel robust acoustic ...
متن کاملLong-term training, transfer, and retention in learning to lipread.
A long-term training paradigm in lipreading was used to test the fuzzy logical model of perception (FLMP). This model has been used successfully to describe the joint contribution of audible and visible speech in bimodal speech perception. Tests of the model were extended in the present experiment to include the prediction of confusion matrices, as well as performance at several different level...
متن کاملLCANet: End-to-End Lipreading with Cascaded Attention-CTC
Machine lipreading is a special type of automatic speech recognition (ASR) which transcribes human speech by visually interpreting the movement of related face regions including lips, face, and tongue. Recently, deep neural network based lipreading methods show great potential and have exceeded the accuracy of experienced human lipreaders in some benchmark datasets. However, lipreading is still...
متن کاملIranian EFL Learners’ Lexical Inferencing Strategies at Both Text and Sentence levels
Lexical inferencing is one of the most important strategies in vocabulary learning and it plays an important role in dealing with unknown words in a text. In this regard, the aim of this study was to determine the lexical inferencing strategies used by Iranian EFL learners when they encounter unknown words at both text and sentence levels. To this end, forty lower intermediate students were div...
متن کامل